| Redcedar | Data | Analyses | Instructions |
Please note this analysis and R Markdown document are in still in development :)
The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.
The steps for wrangling the data are described here.
Data were subset to include only gps information to use in collecting ancillary data.
Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.
ClimateNA version 7.40 -
Variables
Note the below analysis uses the iNat data with 1510 observations. Amazing!
Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)
Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period
Remove variables with variables that have near zero standard deviations (entire column is same value)
Full
There were length(normals)-length(normals.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals)-length(normals.nearzerovar monthly climate
variables.
Monthly
There were
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly climate variables.
Seasonal
There were
length(normals.monthly)-length(normals.seasonal.nearzerovar
monthly variables with zero standard deviation is
Annual
There were
length(normals.monthly)-length(normals.annual.nearzerovar
monthly variables with zero standard deviation.
Remove other explanatory variable categories (binary or five categories)
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 43.43%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 68 116 24 35 16 0.7374517
## Healthy 53 1027 83 80 32 0.1945098
## Other 19 156 63 30 9 0.7725632
## Thinning Canopy 32 153 31 85 8 0.7249191
## Tree is Dead 21 52 6 12 18 0.8348624
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.monthly, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 43.25%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 73 115 23 34 14 0.7181467
## Healthy 53 1034 79 78 31 0.1890196
## Other 19 159 56 31 12 0.7978339
## Thinning Canopy 32 157 28 84 8 0.7281553
## Tree is Dead 19 53 7 12 18 0.8348624
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.seasonal, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 43.52%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 69 117 25 33 15 0.7335907
## Healthy 55 1035 81 76 28 0.1882353
## Other 20 156 60 30 11 0.7833935
## Thinning Canopy 35 156 30 79 9 0.7443366
## Tree is Dead 20 53 6 14 16 0.8532110
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 43.88%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 68 121 22 33 15 0.7374517
## Healthy 56 1025 85 78 31 0.1960784
## Other 23 160 58 28 8 0.7906137
## Thinning Canopy 31 158 27 82 11 0.7346278
## Tree is Dead 20 52 7 12 18 0.8348624
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 32.35%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 954 321 0.2517647
## Unhealthy 400 554 0.4192872
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 32.57%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 941 334 0.2619608
## Unhealthy 392 562 0.4109015
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 32.79%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 944 331 0.2596078
## Unhealthy 400 554 0.4192872
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 33.02%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 940 335 0.2627451
## Unhealthy 401 553 0.4203354
Summary of model performance
| Response | Explanatory | Vars tried split | OOB Error |
| 5 class | Full | 14 | 43.43 |
| 5 class | Monthly | 12 | 43.25 |
| 5 class | Seasonal | 7 | 43.52 |
| 5 class | Annual | 4 | 43.88 |
| Binary | Full | 14 | 32.35 |
| Binary | Monthly | 12 | 32.57 |
| Binary | Seasonal | 7 | 32.79 |
| Binary | Annual | 4 | 33.02 |
Clearly all of the climate variables are highly correlated.
Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables
Below we can check the correlation of CMI, MAP, and DD_18
Now we can check how the model performs with only these three climate variables
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 33.02%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 940 335 0.2627451
## Unhealthy 401 553 0.4203354
It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.